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Abstract 

We describe an end-to-end generative approach for the 
segmentation and recognition of human activities. In this 
approach, a visual representation based on reduced Fisher 
Vectors is combined with a structured temporal model for 
recognition. We show that the statistical properties of 
Fisher Vectors make them an especially suitable front-end 
for generative models such as Gaussian mixtures. The sys¬ 
tem is evaluated for both the recognition of complex activ¬ 
ities as well as their parsing into action units. Using a va¬ 
riety of video datasets ranging from human cooking activi¬ 
ties to animal behaviors, our experiments demonstrate that 
the resulting architecture outperforms state-of-the-art ap¬ 
proaches for larger datasets, i.e. when sufficient amount of 
data is available for training structured generative models. 



1. Introduction 

The growing need for automated video monitoring and 
surveillance systems is quickly reshaping our research land¬ 
scape. Much of the current research on action recognition 
has focused on semi-realistic problems such as categorizing 
short clips consisting of one single action {e.g. kick, pour, 
throw, pick). However, many real-world applications will 
require methods that can solve more realistic problems in¬ 
cluding the recognition and parsing of complex activities in 
long continuous recordings, often consisting of sequences 
of goals and sub-goals. 

Most successful approaches to action recognition have 
typically relied on unstructured models of video sequences. 
A holistic visual representation is usually computed over an 
entire video clip and then passed to a discriminative classi¬ 
fier to yield a single categorization label per video. These 
methods have been successful for the recognition of single¬ 
action video clips (see e.g. 1341 ). However, they do not ap¬ 
pear to be well suited for the recognition of daily activities 
that require the modeling of complex behavior sequences. 

Several extensions of these unstructured models have 
been proposed to try to address this challenge. One popu¬ 
lar approach relies on sliding (temporal) windows whereby 
videos are decomposed into a sequence of shorter segments 


Figure 1: Segmentation and recognition of human activi¬ 
ties with a) the ADL dataset (“dial phone”), b) the Break¬ 
fast dataset (“prepare fried eggs”) and c) the MPII cooking 
dataset (“prepare soup”). 

that can be individually classified with discriminative ap¬ 
proaches GIEID. However, these approaches have, for 
the most part, only been tested on a handful of relatively 
small datasets that do not capture the rich and diverse na¬ 
ture of daily activities. As we will show, these approaches 
are not competitive on more challenging activity datasets. 

Structured temporal models, on the other hand, have 
reached an impressive level of maturity in several engi¬ 
neering domains and speech recognition 1361 in particu¬ 
lar. These models would appear more appropriate than their 
unstructured counterparts for the recognition of human ac¬ 
tivities. Somewhat surprisingly, relatively little effort has 
been devoted to adapting these approaches to human action 
recognition (but see e.g. El El). One of the main reasons 
why structured generative methods have not found more 
widespread acceptance in action recognition is that, unlike 
for speech analysis where large annotated corpora are avail¬ 
able, video databases have been comparatively limited in 
size (m. 
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With the emergence of larger video datasets (e.g. 
CRIM13 m and Breakfast these models are more 

likely to start exhibiting competitive performance. For in¬ 
stance, encouraging results were obtained in |[T^ using Hid¬ 
den Markov Models (HMMs) combined with a context-free 
grammar to learn cooking activities. One of the main limita¬ 
tions associated with standard HMM toolboxes (such as the 
HTK used in ifT^ ) is the use of Gaussian mixtures, which 
typically require input data to be normally-distributed as 
well as low-dimensional to prevent overfitting. Standard vi¬ 
sual representations such as Bag-of-Words or Fisher Vectors 
(FVs) thus constitute a poor choice for HMMs and other 
generative approaches, because they typically yield sparse 
and high-dimensional visual representations. 

Here, we describe an approach for the construction of 
reduced FVs which is particularly amenable to structured 
temporal models. FVs have been shown to achieve state- 
of-the-art accuracy in action recognition na. They have 
also been shown to maintain good classification accuracy 
when used in conjunction with dimension reduction tech¬ 
niques (Sim. Hence, this makes them good candidates 
for modeling by Gaussian mixtures. As we will show, the 
proposed approach yields a very substantial improvement 
in recognition accuracy on a variety of activity segmenta¬ 
tion and recognition tasks, ranging from the recognition of 
human daily activities to the segmentation of rodent social 
interactions. 

To summarize, we describe an approach to improve 
the efficiency of state-of-the-art feature encoding meth¬ 
ods lam that are especially amenable to generative mod¬ 
els. We systematically evaluate the proposed approach us¬ 
ing a variety of standard activity datasets and demonstrate 
significant improvements for datasets that contain sufficient 
training data. 

2. Related work 
2.1. Fisher vectors 

Fisher kernel methods were originally proposed as a way 
to derive kernels for discriminative classifiers from genera¬ 
tive models (91. They were later adapted to represent fea¬ 
ture sets used for image classification CD. The applica¬ 
tion of an 1/2 norm and power normalizations combined 
with a method for sampling FVs based on a spatial pyra¬ 
mid were then shown to significantly improve their accu¬ 
racy (201 . More recently, FVs have been shown to yield 
not only higher classification accuracy, but also much more 
compact feature vectors CD. 

The application of FVs to action recognition was first 
explored in ED, where the authors used a standard video 
descriptor (HOGHOF) to compare different encoding meth¬ 
ods on two different datasets. It was shown that FVs often 
outperform other methods, a result that was further repli¬ 


cated in a separate study ED. The combination of FVs 
and Dense Trajectory Features (DTFs) was also demon¬ 
strated to work exceedingly well for the recognition of ac¬ 
tions (34l[T3. All the aforementioned approaches are based 
on discriminative classification methods trained on (short) 
single-action pre-segmented video clips. We are not aware 
of previous work focusing on the statistical properties of 
FVs in the context of a generative action recognition mod¬ 
els. 

2.2. Structured temporal models 

Most early approaches for action recognition with struc¬ 
tured temporal models relied on either motion capture 
data (8|[28l or hand-labeled trajectories ED Several tem¬ 
porally structured models have been applied since on video 
data including generative mixture models M, Bayes Net¬ 
works (25l and an HMM/SVM combination E). 

More recent work has focused on the problem of detect¬ 
ing and segmenting human activities in videos. In (27l, a 
semantic scene label map was built as context for agent 
actions to automatically learn AND-OR grammars from 
videos. In (D, Linear Dynamical Systems theory was used 
to detect events in complex video datasets. Long-term rela¬ 
tions were also considered in the “sequence memorizer” de¬ 
scribed in O, which uses a Bayesian nonparametric model 
to simultaneously detect and classify events within a video 
stream. A similar idea is proposed in the work of (121, us¬ 
ing a context free grammar in combination with HMMs to 
model longer temporal sequences of smaller action units. 
In (71, activity models were based on the detection of 
changes in state-specific regions of interest (e.g. the lid of 
a coffee jar for ’opening coffee jar’ and ’closing coffee jar’ 
actions). The authors used SVM-based state detectors to 
detect the beginning and end of short task-oriented action 
units such as “hold spoon” or “stir coffee”. 

A higher-level representation based on stochastic 
context-free grammars was used in (32l where body pose 
information (i.e. hand positions) was used for classification. 
A closely related approach was proposed in (211 where ac¬ 
tion units were combined with a set of production rules to 
build a grammar to model the hierarchical temporal struc¬ 
ture of human activities. The system was able to learn and 
parse action units derived from the Olympic sport dataset. 

Here, we build on our earlier work CD using HMMs 
combined with a simple grammar to model complex human 
activities as sequences of action units. 

3. System description 
3.1. Fisher vectors 

We briefly review the key steps involved in FV com¬ 
putation and frame-based action recognition. We refer the 
reader to (26l for a more detailed description. The main as- 


Sample data w/o PCA Mean = 0.0 


4 

L 

-1 -0.5 0 0.5 1 

Mean 

Sample data with PCA 

1 

' 


-4 -2 0 2 4 


Std. Dev. = 0.2, 
Median = 0.0 


Lilliefors: p = 0.001 
Jarque-Bera: p = 0.001 


Mean = 0.0, 

Std. Dev. = 1.0, 
Median = 0.0 


Lilliefors: p = 0.411 
Jarque-Bera: p = 0.025 


Mean 


Results of normality test for FV data 



Figure 2: Distribution of FV samples before and after PCA 
and results of normality test (Lil = Lilliefors, Jb = Jarque- 
Bera) with decreasing significance levels for FV samples 
before and after PCA. 


sumption behind FVs is that local feature descriptors may 
be modeled by a probability density function. Here, we 
consider a Gaussian mixture Model (GMM) with K com¬ 
ponents defined by the associated mixture weights, mean 
vectors fik and variances (7/^. FVs characterize how a fea¬ 
ture set X = {xt\t = 1,..., T} deviates from a learned 
distribution. For each feature set X, the resulting gradients 
and each have the dimensionality D of the original 
feature descriptor and they are computed for each mixture 
of the GMM as described in 


3.2. Normality test 


The HTK recognition framework used here (see sec¬ 


tion |3.3| ), like most other systems for automated speech 
recognition, relies on HMMs with observation probabilities 
modeled by Gaussian mixtures. Higher dimensional Gaus¬ 
sian mixtures are prone to overfitting, especially when given 
only a limited amount of training data. This can be compen¬ 
sated to a certain extent by reducing the number of mixtures 
used. In general, we found that best results were obtained 
with one Gaussian per state which is consistent with the 
practice reported in ifT^ . It is thus highly desirable for 
input data to be normally distributed. 

In order to test the normality of FVs for video data, 
we considered different normality tests. To evaluate how 
dimensionality reduction using PCA affects the normality 
of the resulting feature vector, we randomly sampled data 
along each dimension of the feature vectors and test the 
skewness and kurtosis of the resulting distributions using 
the Lilliefors and the Jarque-Bera test cni, respec¬ 
tively. We tested the null hypothesis that a given dimension 
is normally distributed and estimated the number of dimen¬ 
sions for which the null hypothesis is valid (for decreasing 
significance levels in the range 0.5-0.001). We applied this 
test to FV samples before and after PCA. Results shown 
in Figure confirm that PCA yields distributions that are 
closer to a normal distribution. For instance, at a signif¬ 
icance level of a = 0.001, a mere 0.53% of the original 
FV dimensions pass the Lillifors test (none for Jaque-Bera), 
whereas 84.3% (79.6% Jaque-Bera) of the PCA-reduced 
data dimensions pass significance. This is quite evident al¬ 
ready when we consider the first dimension of the feature 
vector (before and after PCA), as shown in Figure For 
comparison, we also tested the BoWs as used in our pre¬ 
vious work (El. Here, the null hypothesis was always re¬ 
jected, irrespective of the significance level, suggesting that 
none of the dimensions are normally distributed. 

Overall, PCA helps to build a feature vector that better 
fits the normality assumption of the proposed HMM-based 
model. As we will show in Section [4^ this yields signifi¬ 
cant gains in activity recognition accuracy. 


3.3. A generative recognition pipeline 


The concatenation leads to an overall 2 x D x iT di¬ 
mensional FV representation x of the original feature set X 
with X = [Q^ k]'- Following 1^ , we applied an L2- 

normalization to these vectors. Additionally, the authors 
in ll^ observed that the more Gaussian components are 
used, the sparser the FVs become. We followed their sug¬ 
gestion to use a power normalization (g{x) = sign{x)\/x) 
to reduce the sparsity of the FVs. As the resulting FVs are 
too high dimensional to be processed in a generative frame¬ 
work, we used PCA to reduce the overall dimensionality of 
the feature vector m and to further whiten the data. 


In the following, we briefly give an overview of the 
pipeline used (see Figure |^. We used an improved ver¬ 
sion of the Dense Trajectory Features (DTFs) |[33l for 
datasets with camera motion. The dimensionality of the fea¬ 
ture descriptors was first reduced from 426 dimensions to 
64 dimensions by PCA, following the procedure described 

inini. 

We sampled 200,000 random features to fit the GMMs. 
FVs were computed using 50,000 frames sampled from the 
training data. For each reference frame, FVs were com¬ 
puted over a 20-frames sliding window. The dimensionality 
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Figure 3: Overview of the recognition pipeline: DT features are computed and the corresponding descriptor is reduced to 
64 dimensions. A total of 200,000 features are randomly sampled and fitted to GMMs {K = 16,32,64,128 or 256). An 
FV representation is computed for each frame of the video. The corresponding representation is further reduced from 2048 
- 32,768 down to 64 dimensions. During training, each HMM is initialized with action unit samples. State boundaries are 
re-estimated and the GMMs are updated according to the new state boundaries until convergence. During recognition, HMMs 
are combined with a learned context-free grammar and the most probable sequence of action units is determined. 


of the resulting vector was further reduced to 64 dimensions 
using PCA (see section [TT] ). Thus, each frame is then rep¬ 
resented by a 64-dimensional FV. We further applied an L2- 
normalization to each feature dimension separately for each 
video clip. 

The proposed recognition system contains two main 
components: a set of HMMs is used to model all possi¬ 
ble action units found in the dataset and a grammar is used 
to model possible sequences of those units. The number 
of hidden states for each HMM was set to 1/10 of the 
mean length of the corresponding action units. All HMMs 
are based on a left-to-right feed-forward topology, allowing 
only self-transitions and transitions to the next state. The 
initial state transitions probabilities were set to default val¬ 
ues (self: p = 0.6, nextp = 0.4). To initialize the state dis¬ 
tribution, we subdivided each action unit evenly over time 
and associated each subdivision to a hidden state. Thus, 
frames at the beginning or end of an action unit get always 
associated to the first and last states due to the left-to-right 
topology. 

During training, unit states were re-estimated using the 
Baum-Welch algorithm, i.e. by finding the HMM parame¬ 
ters that maximise the probability of a given set of obser¬ 
vations. For details concerning the training and recognition 
with HTK, we refer the reader to to lEsiiia for details. As 
the number of samples per class or in our case, per action 
unit follows a long-tail distribution, with few classes being 
frequent and a large number of classes being relatively rare, 
we enforced a minimum and maximum number of training 
samples (see Table [T]) for a balanced training data set across 
classes. When needed, artificial samples were generated by 
synthetic minority over-sampling to guarantee a minimum 
number of samples. 


During recognition, we followed the approach described 
in by C2 formalizing activity recognition and segmenta¬ 
tion as the problem of finding the most probable sequence 
of action units from an observed input sequence. A context 
free grammar was built automatically using available anno¬ 
tations. For the GRIM 13 dataset El), we favored a bi-gram 
model which defines the transition probability to the next 
possible units instead of absolute paths. This is a richer 
model and it is more appropriate for modeling animal be¬ 
havior which tends to be relatively stochastic compared to 
the human activities found in other datasets. 

The Viterbi algorithm was used to find the most proba¬ 
ble sequence of action units. The output of the algorithm 
includes the best matching sequence of action units, their 
beginning and end frames, and the corresponding observa¬ 
tion probabilities (see 1^ ). 

4. Evaluation 
4.1. Datasets 

Recent years have seen a significant increase in the avail¬ 
ability of public activity datasets. To evaluate the proposed 
architecture, we considered complex activity datasets (as 
opposed to single task-oriented action) that are labeled at 
one or more levels of granularity. The datasets found suit¬ 
able for this evaluation included: ADL llT4ll . Olympics (161, 
Toy Assembly (321, CMU-MMAC (22|, MPIICooking (23l, 
SOSalads (ll. Breakfast (El, and CRIM13 O. Sample 
frames for each of these datasets are shown in Figure]^ 

The recognition tasks for these datasets typically include 
activity classification, action unit detection and segmenta¬ 
tion. The only exception is the Olympic Sport dataset, 
where no action unit labeling exists. For this dataset, we 




















































Figure 4: Sample frames from the datasets used for performance evaluation: a) ADL H^ . b) Olympic (161, c) ToyAssem- 
bly (321, d) CMU-MMAC (291, e) MPIICooking (21, f) SOSalads (30l, g) Breakfast (H, and h) CRIM13 d. 



Duration 

Train samples used per class 

ADL 

40 min 

12-30 samples 

Olympics 

90 min 

70-80 samples 

Toy 

64 min 

15-20 samples 

CMU 

265 min 

30-40 samples 

MPII 

490 min 

12-30 samples 

SOSalads 

320 min 

30-35 samples 

BF 

66.7 h 

50-70 samples 

GRIM 13 

32.4 h 

80-100 samples 


Table 1: Overall duration of the different datasets and num¬ 
ber of samples available for training. 

manually labeled 10 clips per class and used these annota¬ 
tions for initializing the system. We then applied the recog¬ 
nition scheme to the remaining training clips and used the 
system outputs as labels for the training phase. 

Some of the selected datasets provide additional bene¬ 
fits such as multi-modal signals or multi-view settings. For 
this evaluation however, we only considered video data. All 
videos were separately processed and evaluated and we did 
not apply any method for combining camera input from dif¬ 
ferent views. The duration of the datasets and the number 
of samples used for training is shown in Table 

4.2. System evaluation 

We first compare the accuracy of the proposed reduced 
FVs against that of our previous work using HTK in com¬ 
bination with HOGHOF for the Breakfast dataset (Ta¬ 
ble [^. Replacing HOGHOF with DTFs already improves 
the overall system accuracy by ^ 10 — 14% (Table 
HTK-fHOGHOF w PCA compared to HTK-fDTF w PCA). 

To evaluate the impact of the reduced FVs on a gen¬ 
erative vs. a discriminative framework, we compared the 


Breakfast dataset - FV 


GMMs = 16 32 64 128 256 

1) SVM+DTF w/o PCA 

2) SVM+DTF w PCA 

52.0 52.6 48.7 39.6 23.2 

D' = 64 42.0 42.5 42.8 40.3 41.2 

3) HTK+HOGHOF w PCA 

4) HTK+DTF w PCA 

D' - 64 6T3 60 02^ 60J 

D' = 64 71.5 72.2 73.3 68.6 66.4 


Table 2: Comparison between HTK vs. SVM and 
HOGHOF vs. DTFs for activity recognition (in combina¬ 
tion with FV-based encoding on the Breakfast dataset). 

proposed pipeline against one where the HTK classification 
stage was replaced with an SVM (for both the full FV rep¬ 
resentation with 2,048-32,768 dimensions and K= 16-256 
GMMs and the reduced FV representation with 64 dimen¬ 
sions). Classification was based on the libSVM software 
library O using a linear kernel. We used identical features 
and GMM clusters as in the proposed HTK-based system. 
Note, however, that for the SVM baseline, features were 
sampled from the entire video sequence because we found 
it to work better than a frame-based sampling as used for 
HTK. 

As Table shows, SVM-based classification performs 
better when using the full FV representation for classifica¬ 
tion compared to the reduced FV representation. However, 
the accuracy of the SVM-based classification remains sig¬ 
nificantly below the accuracy of the system based on HTK 
by ^ 20 — 30% with identical features. Our results show 
that, compared to the baseline reported in (121, reduced FVs 
improve the recognition accuracy by ^ 20% for HOGHOF 
and - 30% for DT. 

4.3. Segmentation 

To evaluate the segmentation accuracy of the proposed 
system, we consider eight different datasets (see section [4T] 
for details). As the original benchmarks for these datasets 


































Segmentation 

GMM= 

ADL 

Oly. 

Toy 

CMU 

MPII 

50Salad 

BF 

GRIM 13 

16 

53.4 

62.4 

50.3 / 64.3 

53.S 160.8 

46.5/5S.5 

81.6 

36.2 / 54.2 

52.6 

32 

54.5 

66.1 

48.6/65.7 

53.1160.7 

53.9 / 68.5 

80.4 

36.9 / 54.4 

53.5 

64 

55.7 

67.5 

56.7 / 67.5 

53.0 / 60.3 

51.6/63.9 

83.8 

38.1/56.5 

53.4 

128 

58.9 

65.9 

60.5 / 70.8 

52.5 / 60.4 

53.9/66.8 

82.0 

34.0/57.2 

52.6 

256 

54.4 

63.7 

63.5 / 72.2 

58.8/67.7 

57.3/77.7 

83.8 

32.1 / 50.7 

53.3 

Best 

- 

- 

- /91.0^7r 

- / 59.61^ 

- /54.3\TSr 

67.611301 

- /IR.RiVir 

39.1 El 


Table 3: Overview of the segmentation results for all datasets. Accuracy is computed as the mean over all classes. For 
comparison, we also report the frame-based accuracy (italic) for the Toy, CMU and BF dataset, and midpoint hit accuracy 
(also italic) for the MPII dataset as used by the authors in the original studies. 


are based on different measurements, we report multiple ac¬ 
curacy measures for fair comparison to these baseline sys¬ 
tems. One measure reported uses the mean accuracy over 
all classes (corresponding to the mean accuracy computed 
over the diagonal of the corresponding confidence matrix) 
as used in |[23l[30l[2l. In addition, we also report the frame- 
based accuracy (corresponding to the mean proportion of 
correctly classified frames) for the Toy, CMU and Break¬ 
fast dataset as used in |[32j[T2l. For the MPII dataset, we 
also report the mid-point hit accuracy as defined in ll^ . 

Segmentation results for the proposed system and avail¬ 
able benchmarks are reported in Table Note that for 
the two smallest datasets (ADL and Olympics), no bench¬ 
mark is available as no segmentation results have been pre¬ 
viously reported for these datasets. It is pretty clear that 
the proposed approach under-performs the best segmenta¬ 
tion results obtained for the Toy assembly dataset (which 
remains a small video dataset with about one hour of video). 
For large datasets (8 hours or more of video), the system 
significantly outperforms the state of the art in terms of 
segmentation accuracy (e.g. BF -\-21.5%). Note that for 
the CRIM13 dataset, the benchmark approach is based on 
spatial-temporal features na. For this evaluation, we only 
considered side-view videos and report the accuracy of the 
benchmark system for the same set of videos as reported in 
the original study. Sample segmentation results are shown 
in Figurefor the ADL, MPII and Breakfast datasets. 

4.4. Activity classification 

Here, we evaluated the accuracy of the proposed sys¬ 
tem for activity classification (Table [^. We only con¬ 
sidered datasets that provide multiple activity classes (i.e. 
ADL, Olympic and Breakfast datasets). Consistent with 
earlier experiments, the accuracy of the proposed system 
is below the state of the art for smaller datasets (e.g. ADL 
and Olympic Sports) but outperforms the state of the art 
when enough training samples are available (e.g. Breakfast 
dataset). 


Activity classification 

GMM= 

ADL 

Olympics 

BF 

16 

86.0 

74.4 

71.5 

32 

86.7 

76.8 

72.2 

64 

91.3 

77.6 

73.3 

128 

94.7 

77.2 

68.6 

256 

87.3 

74.4 

66.4 

Best 

98.7 El 

90.211341 

40.5 d 


Table 4: Activity classification results. 


5. Conclusion 

In this paper, we studied how different feature repre¬ 
sentations affect the performance of a structured generative 
(temporal) model based on the HTK framework. We per¬ 
formed a systematic evaluation of the proposed approach 
and compared the accuracy of the resulting system against 
the state of the art for both activity segmentation and clas¬ 
sification. Our results showed that combining a compact 
video representation based on Fisher Vectors with Hid¬ 
den Markov Models yields very significant gains in accu¬ 
racy for both the recognition of goal-oriented activities and 
their parsing at the level of task-oriented action units. In¬ 
deed, when sufficient training data was available, we found 
that structured generative temporal models outperform the 
state of the art. These results are consistent with recent 
trends in other areas of computer vision suggesting that, as 
datasets are becoming increasingly large, structured models 
are starting to outperform the state of the art. 
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Figure 5: Sample segmentation results for a) the ADL dataset (“dial phone”), b) the MPII cooking dataset (“prepare 
cold drink”), and c) the Breakfast dataset (“prepare scrambled eggs”). The upper/lower color bars correspond to ground- 
truth/system outputs, respectively. 
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